In [36]:
'''
     Importing libraries 
     - Creating data sets 
     - Creating data frames 
     - Reading from CSV 
     - Exporting to CSV 
     - Finding maximums 
     - Plotting data
     
     
 Create Data - We begin by creating our own data set for analysis. 
 This prevents the end user reading this tutorial from having to download any files to replicate the results below. 
 We will export this data set to a text file so that you can get some experience pulling data from a text file.
     
Get Data - We will learn how to read in the text file. The data consist of baby names and the number of baby names born in the year 1880.
Prepare Data - Here we will simply take a look at the data and make sure it is clean. By clean I mean we will take a look inside the contents of the text file and look for any anomalities. These can include missing data, inconsistencies in the data, or any other data that seems out of place. If any are found we will then have to make decisions on what to do with these records.
Analyze Data - We will simply find the most popular name in a specific year.
Present Data - Through tabular data and a graph, clearly show the end user what is the most popular name in a specific year.
'''



# Enable inline plotting
%matplotlib inline

# General syntax to import specific functions in a library: 
##from (library) import (specific library function)
from pandas import DataFrame, read_csv

# General syntax to import a library but no functions: 
##import (library) as (give the library a nickname/alias)
import matplotlib.pyplot as plt
import pandas as pd #this is how I usually import pandas
import sys #only needed to determine Python version number
import matplotlib #only needed to determine Matplotlib version number

In [37]:
print('Python version ' + sys.version)
print('Pandas version ' + pd.__version__)
print('Matplotlib version ' + matplotlib.__version__)


Python version 2.7.13 |Anaconda 4.3.0 (64-bit)| (default, Dec 19 2016, 13:29:36) [MSC v.1500 64 bit (AMD64)]
Pandas version 0.19.2
Matplotlib version 2.0.0

Create Data The data set will consist of 5 baby names and the number of births recorded for that year (1880).


In [38]:
'''
Create Data
'''
# The inital set of baby names and bith rates
names = ['Bob','Jessica','Mary','John','Mel']
births = [968, 155, 77, 578, 973]

In [43]:
BabyDataSet = list(zip(names,births))
BabyDataSet


Out[43]:
[('Bob', 968), ('Jessica', 155), ('Mary', 77), ('John', 578), ('Mel', 973)]

In [45]:
df = pd.DataFrame(data = BabyDataSet, columns=['Names', 'Births'])
df


Out[45]:
Names Births
0 Bob 968
1 Jessica 155
2 Mary 77
3 John 578
4 Mel 973

In [50]:
'''
Get Data
'''
df.to_csv?

In [51]:
df.to_csv('births1880.csv',index=False,header=False)

In [52]:
read_csv?

In [56]:
Location = r'C:\Users\cr\Documents\UCM 4\MD\teamMin\tutorial_pandas\births1880.csv'
df = pd.read_csv(Location, header=None, names=['Names','Births'])
df


Out[56]:
Names Births
0 Bob 968
1 Jessica 155
2 Mary 77
3 John 578
4 Mel 973

In [57]:
import os
os.remove(Location)

In [58]:
'''
Prepare Data
'''
# Check data type of the columns
df.dtypes


Out[58]:
Names     object
Births     int64
dtype: object

In [59]:
'''
Analyze Data
'''
# Check data type of Births column
df.Births.dtype


Out[59]:
dtype('int64')

In [62]:
# Method 1:
Sorted = df.sort_values(['Births'], ascending=False)
Sorted.head(1)


Out[62]:
Names Births
4 Mel 973

In [63]:
# Method 2:
df['Births'].max()


Out[63]:
973

In [88]:
'''
Present Data
'''
# Create graph
df['Births'].plot()

# Maximum value in the data set
MaxValue = df['Births'].max()

# Name associated with the maximum value
MaxName = df['Names'][df['Births'] == df['Births'].max()].values

# Text to display on graph
Text = str(MaxValue) + " - " + MaxName

# Add text to graph
plt.annotate(Text, xy=(1, MaxValue), xytext=(8, 0), 
                 xycoords=('axes fraction', 'data'), textcoords='offset points')

print("The most popular name")
df[df['Births'] == df['Births'].max()]


The most popular name
Out[88]:
Names Births
4 Mel 973